126 research outputs found
5,13-Disulfamoyl-1,9-diazatetracyclo[7.7.1.02,7.010,15]heptadeca-2(7),3,5,10,12,14-hexaen-1-ium chloride
In the title salt, C15H17N4O4S2
+·Cl−, the chloride anion is disordered over two positions with occupancies of 0.776 (6) and 0.224 (6). The cation adopts an L shape and the dihedral angle between the benzene rings is 82.5 (3)°. In the crystal, inversion dimers of cations linked by pairs of N—H⋯N hydrogen bonds occur, with the bond arising from the protonated N atom. The cationic dimers are linked into chains via the disordered chloride ions by way of N—H⋯Cl hydrogen bonds and N—H⋯O, C—H⋯O and C—H⋯Cl interactions also occur, which help to consolidate the three-dimensional network
Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots
Improving the generalization capabilities of general-purpose robotic agents
has long been a significant challenge actively pursued by research communities.
Existing approaches often rely on collecting large-scale real-world robotic
data, such as the RT-1 dataset. However, these approaches typically suffer from
low efficiency, limiting their capability in open-domain scenarios with new
objects, and diverse backgrounds. In this paper, we propose a novel paradigm
that effectively leverages language-grounded segmentation masks generated by
state-of-the-art foundation models, to address a wide range of pick-and-place
robot manipulation tasks in everyday scenarios. By integrating precise
semantics and geometries conveyed from masks into our multi-view policy model,
our approach can perceive accurate object poses and enable sample-efficient
learning. Besides, such design facilitates effective generalization for
grasping new objects with similar shapes observed during training. Our approach
consists of two distinct steps. First, we introduce a series of foundation
models to accurately ground natural language demands across multiple tasks.
Second, we develop a Multi-modal Multi-view Policy Model that incorporates
inputs such as RGB images, semantic masks, and robot proprioception states to
jointly predict precise and executable robot actions. Extensive real-world
experiments conducted on a Franka Emika robot arm validate the effectiveness of
our proposed paradigm. Real-world demos are shown in YouTube
(https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili
(https://www.bilibili.com/video/BV178411Z7H2/ )
AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation
We propose a novel framework for learning high-level cognitive capabilities
in robot manipulation tasks, such as making a smiley face using building
blocks. These tasks often involve complex multi-step reasoning, presenting
significant challenges due to the limited paired data connecting human
instructions (e.g., making a smiley face) and robot actions (e.g., end-effector
movement). Existing approaches relieve this challenge by adopting an open-loop
paradigm decomposing high-level instructions into simple sub-task plans, and
executing them step-by-step using low-level control models. However, these
approaches are short of instant observations in multi-step reasoning, leading
to sub-optimal results. To address this issue, we propose to automatically
collect a cognitive robot dataset by Large Language Models (LLMs). The
resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of
multi-step text plans and paired observation sequences. To enable efficient
data acquisition, we employ elaborated multi-round prompt designs that
effectively reduce the burden of extensive human involvement. We further
propose a closed-loop multi-modal embodied planning model that autoregressively
generates plans by taking image observations as input. To facilitate effective
learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and
finetune additional vision adapter and Q-former to enable fine-grained spatial
perception for manipulation tasks. We conduct experiments to verify the
superiority over existing open and closed-loop methods, and achieve a
significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4
based robot tasks. Real-world demos are shown in
https://www.youtube.com/watch?v=ayAzID1_qQk
3,5-Dimethyl-1H-pyrazole–2-hydroxy-5-(phenyldiazenyl)benzoic acid (1/1)
There are two independent 3,5-dimethylpyrazole and two independent 2-hydroxy-5-(phenyldiazenyl)benzoic acid molecules [in which intramolecular O—H⋯O bonds form S(6) graph-set motifs] in the asymmetric unit of the title compound, C5H8N2·C13H10N2O3. In the crystal, the components are linked by intermolecular O—H⋯O, O—H⋯N and N—H⋯O hydrogen bonds, forming four-component clusters. Further stabilization is provided by weak C—H⋯π interactions
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
We propose the first joint audio-video generation framework that brings
engaging watching and listening experiences simultaneously, towards
high-quality realistic videos. To generate joint audio-video pairs, we propose
a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled
denoising autoencoders. In contrast to existing single-modal diffusion models,
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising
process by design. Two subnets for audio and video learn to gradually generate
aligned audio-video pairs from Gaussian noises. To ensure semantic consistency
across modalities, we propose a novel random-shift based attention block
bridging over the two subnets, which enables efficient cross-modal alignment,
and thus reinforces the audio-video fidelity for each other. Extensive
experiments show superior results in unconditional audio-video generation, and
zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve
the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of
10k votes further demonstrate dominant preferences for our model. The code and
pre-trained models can be downloaded at
https://github.com/researchmm/MM-Diffusion.Comment: Accepted by CVPR 202
- …